Design of Area-Power Efficient Parallel Fir Filter with Mux Based Full Adder

Authors: Sameena Mohammad, K. Rama Devi

DOI Link: https://doi.org/10.22214/ijraset.2022.46913

Abstract

In this article, here we propose a new architecture of hardware for a high-speed finite impulse response (FIR) filter using seamless fine-grained pipelining. This proposed parallel full pipeline FIR filter can generate an output sample in a limited gate delays by fixing the pipeline registers in between components and also across the components. A precise critical path analysis at the gate level allows to create a suitable pipelining strategy depending on the throughput. This paper also presents modified full adder, based on multiplexers which establishes trade-offs in terms of area, power and delay. The advance FIR filters are incorporate to measure the balance between complexity and speed, and also maximum throughput. The effectiveness of the proposed method is synthesized and simulated using Xilinx Vivado 2018.3 along with improvement of area, delay and power consumption.

Introduction

I. INTRODUCTION

A digital multiplier is the main fundamental unit in digital signal processors and also in general purpose microprocessors. To find such operation it is very easy in signal processing and matrix arithmetic algorithms. Multipliers are usually implemented using with the modified booth algorithm (MBA) which requires the moderate hardware [4] but the results in unsuitable long delays. Here successful in developing high speed multiply accumulate structure based on the Baugh-Wooley algorithm (BWA) and applying this structure to several digital filtering applications.

A. Partial-product Generation (PPG)

This can be achieved using several techniques such as the BWA, the booth algorithm (BA) or MBA. Inclusion to the encoding step, MBA and BA algorithms are also require to generate the two’s complement of the multiplier which introduces extra delay.

B. Partial-product Addition (PPA)

This module can be used with a ripple carry adder for serial-parallel multipliers. Mainly for the parallel multipliers, the addition is consummate using Wallace trees, carry save techniques etc., In this paper, we employ the more regular neighbour to neighbour interconnection pattern while achieving the same desired result.

C. Final Adder

When the given number of partial products are reduced to sum and carry wards, a final adder is required to generate the result of multiplication. Generally, authors simply used that 4-bit CLAs.

If the timing constraint becomes loose enough that a full parallel architecture is not required, then it may be necessary to reduce the number of processing elements. In this paper, we propose two alternative architectures by modifying the reference FPFF: single-MAC FIR filter (SMFF) and folded FIR filter (FDFF). They compensate for the increase in propagation delay due to recursive computations by appropriately using compressors designed to prevent carry propagation and separate clock sources. In summary, the contribution of this paper is as follows:

A novel high-speed full-parallel architecture of the FIR filter based on MBE and Wallace tree is designed at the gate level.
A hierarchical Wallace tree network is employed to facilitate the design and pipelining procedure.
A method to find the critical path of the proposed architecture is provided.
Two alternative architectural options are proposed structure and analysis are provided with the synthesis results.
Practical design examples based on the proposed structure and analysis are provided with the synthesis results.

A Multiplier usually has three steps. The partial product generation (PPG) process is the first step. For example, the AND gates can be used to generate a partial product matrix (PPM) for an unsigned multiplication. The partial product reduction (PPR) process is the second step. By using Dadda tree approach or the Wallace tree approach, the structure of PPM can be reduced to become two rows. The final addition process is the third step.

For MB recording, at least three signals are needed to represent the digit set {-2, -1, 0, 1, 2}. Many different ways have been developed, and Table I shows the encoding scheme proposed in that is adopted to implement the proposed MBE multiplier.

II. Existing Method

In the existing method, FIR filter with modified booth encoders for multiplication is performed and the accumulation of products for each tap is performed and the final product is obtained. A filter design can be defined as, it is the process of choosing the length and coefficients of the filter [4].

The most popular type of filters is Finite Impulse Response (FIR) is simply to implemented in software application. Filters are also known as signal conditioners.

Each function by accepting an input signal, blocking pre-specified frequency components, and passing the original minus those components to the output. The architecture of the reference FPFF, consists of modified Booth encoders, a hierarchical Wallace reduction tree (WRT) network, and a ripple carry adder (RCA). A Booth encoder is widely used as a multiplier for the FIR filter especially, modified Booth encoder (MBE), one of the most efficient multipliers among the existing Booth encoder products along with the results with in the smallest number of partial products (PPs).

In the reference K-tap FPFF, a total K MBEs are used, one per each tap. Assuming m is even, multiplication of the m-bit in out by the m-bit coefficient using MBE yields a total m/2 PPs. WRT is responsible for reducing several PPS generated by MBE into two PPs. The key task of WRT is to group 2 or 3 bits at the same bit position and use a half adder (HA) or a full adder (FA) to bit wise reduce the 2 bits over 2 consecutive bit positions. The last process continues until two PPs are left over several levels within the Wallace Reduction Tree.

Andrew Donald Booth is the inventor for Booth multiplication algorithm or Booth algorithm. It can be defined as multiplying binary numbers in two’s complement notation or an algorithm. This is a simple method to which multiplication is performed for binary numbers with repeated addition operations through booth algorithm. Again, this algorithm is for multiplication operation and it is further modified then it is known as modified booth algorithm.

As a design example, consider an FIR filter with 16 taps. Then, the first stage WRT generates a total of 32 PPs, i.e., 2 PPs per tap. The second stage WRT is composed of 2 WRTs with 16 PPs or 4 WRTs with 8 PPs. If 2 WRTs with 16 PPs each are selected, 4 PPs are generated by the second stage WRT. The next third stage WRT receives 4 PPs from the second stage WRT and processes them. Finally, 2 PPs are passed to the RCA.

The SMFF performs K-tap filtering Through K recursive MAC computations over K clock cycles. Therefor, the area of the SMFF can be significantly reduced at the expense of throughput. The detail structure of the proposed SMFF with K16- bit inputs and K 16-bit coefficients. The hierarchical structure of 10:2 compressor array has a shorter propagation delay than a corresponding 5 level WRT with 10 PPs as input 10 XOR gate delays. WRT is more advantageous to have a regular structure to apply the proposed pipelining technique. The accumulation and multiply process has repeated K times. And also, the final two PPs are added using the Ripple Carry Adder. The important thing to know the last addition only needs to be performed once it is finally obtaining the output y(n) which is having without perform the operation for each known tap.

The required number of levels depends on the number of PPs. Since the propagation delay of each level is ar most FA delay, WRT is suitable for speeding up the additions with the pipelining inside the adder tree of the FIR filter [7]. This is because the operation of one FA does not depend on the result of other FAs or Has in the same level. Generally, one m-bit MBE produces m/2 PPS, the total number of PPs in K taps is equal to m/2 * K. The ripple carry adder (RCA) is used to add the final two PPs, which are the outputs of the last stage WRT to create the output of the FIR filter.

The RCA is the simplest adder that performs m-bit addition by m full adders (FA) connected in series. Alternative fast adders such as a carry look ahead adder (CLA) or other variants may be selected, but RCA is used in this paper in order to have a regular structure with WRT which consists of only bit adders such as FAs.

III. Design of Proposed FIR Filter

The main aim of the paper is to improve the area, power and delay w.r.to the replace mux based Full adder. Because in existing method the XOR gates with Full adder is an average range of area power and delay. So in this case we are using Multiplexer Full adder. The area, power and delay are related to the FIR filter where these are working with internally the applications of SMFF and FDFF [8]. There are a lot of paths from each input bit to each output bit, of which the critical path with the longest propagation delay should be found through analysis. The propagation delay is estimated by simplifying their logic circuit with only 4:1 MUX. The delay of the logic inversion is ignored. The logic circuit to be synthesized is reasonably predicted based on the logic expression to obtain the variable.

Multiplexer is a combinational circuit that has maximum of 2n inputs, ‘n’ selection lines and single output line. One of these data inputs will be connected to the output based on the values of selection lines. Multiplexer is also called as MUX. 4 x 1 Multiplexer has four data inputs I3, I2, I1 & I0, two selection lines S1 & S0 and one output Y. The block diagram of 4 X 1 multiplexer is shown in the following figure.

Based on the reference full-parallel architecture, let us consider a design alternative to find different trade-offs between area and throughput [15]. A single-MAC FIR filter (SMFF) uses components of the reference of FDFF, but involves only one MAC unit. The multiply and accumulation process is repeated K times, and the final two PPS are added using the RCA.

IV. Discussion

The complexity of area for the proposed filter is approximately proportional to the increase of the input bit-width, the number of taps. The number of taps of the FIR filter increases, the throughput decreases. The FIR filter is normally used as part of a system compare to complete the system by itself. The FIR filter designs are targeting the integrated circuit implementations such as the FPGAs or ASICs which use fixed point arithmetic preferably than floating point arithmetic to reduce the complexity of implementation and overall chip cost. Matrix multiplications and discrete transforms basically consist of many dot product operations, but their high complexity makes it difficult to achieve high throughput. When comparing to previous article in this paper area, delay and power consumption are reduced by replacing simple FIR filter through changing with MUX based full adder.

V. Simulation results

In this paper we coded all the proposed FIR filters along with SMFF and FDFF which are having a small complicated and also applications of the FIR filters. Comparing to the previous article the area, delay and power consumption is reduced. The software which is used in this paper is Xilinx Vivado 2018.3 and also it is applicable in Synopsis.

VI. Applications of the Proposed method

The proposed method can be used in a variety of high-speed applications or in the digital world modules which are mainly based on FIR filters are:

It can be used as a Convolutional Neural Network (CNN) for classification, segmentation and also for other auto correlated data.
It can be used as a image appliances for image transforming and data interpreting.
It is used as digital signal processing applications like audio signal, speech processing and RADAR, SONAR, voice recognition etc..,.

FIR filters are also used in audio system, control system, medical devices, deep neural networks etc..,.

Conclusion

From this paper, here we are implementing parallel full FIR filters that can provide very high throughput. The proposed method which begins with respect to parallel full design based on MBE, hierarchical Wallace tree network and ripple carry adder. The propagation delay of the FIR filter has been implemented in terms of unit gate level and also by implementing a novel full adder which contains of multiplexers. As a result, the new proposed parallel full design can achieve very high throughput along with reduced area, delay and power compared to FIR with conventional full adder implementation. The main significance of this paper is that proposed FIR filter can provide scalability of applications that can be require at most of very high throughput rate.

References

[1] Eghbali, H. Johansson, O. Gustafsson, and S. J. Savory, ‘‘Optimal least-squares FIR digital filters for compensation of chromatic dispersion in digital coherent optical receivers,’’ J. Lightw. Technol., vol. 32, no. 8, pp. 1449–1456, Apr. 2014. [2] D. Kang, Y. Kang, and Y. Hong, ‘‘VLSI implementation of fractional motion estimation interpolation for high efficiency video coding,’’ Electron. Lett., vol. 51, no. 15, pp. 1163–1165, Jul. 2015. [3] M. S. Hosseini and K. N. Plataniotis, ‘‘High-accuracy total variation with application to compressed video sensing,’’ IEEE Trans. Image Process., vol. 23, no. 9, pp. 3869–3884, Sep. 2014. [4] J. Wang, J. Lin, and Z. Wang, ‘‘Efficient hardware architectures for deep convolutional neural network,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 6, pp. 1941–1953, Jun. 2018. [5] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, ‘‘An architecture to accelerate convolution in deep neural networks,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 4, pp. 1349–1362, Apr. 2018. [6] P. Meher and S. Park, ‘‘Design of cascaded CORDIC based on precise analysis of critical path,’’ Electronics, vol. 8, no. 4, p. 382, Mar. 2019. [7] P. K. Meher and M. Maheshwari, ‘‘A high-speed FIR adaptive filter architecture using a modified delayed LMS algorithm,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2011, pp. 121–124. [8] S. K. Patel and S. K. Singhal, ‘‘Area–delay and energy efficient multioperand binary tree adder,’’ IET Circuits, Devices Syst., vol. 14, no. 5, pp. 586–593, Aug. 2020. [9] L. Dadda and V. Piuri, ‘‘Pipelined adders,’’ IEEE Trans. Comput., vol. 45, no. 3, pp. 348–356, Mar. 1996. [10] S.-R. Kuang, J.-P. Wang, and C.-Y. Guo, ‘‘Modified booth multipliers with a regular partial product array,’’ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 5, pp. 404–408, May 2009. [11] J.-Y. Kang and J.-L. Gaudiot, ‘‘A simple high-speed multiplier design,’’ IEEE Trans. Comput., vol. 55, no. 10, pp. 1253–1258, Oct. 2006. [12] A. Fathi, B. Mashoufi, and S. Azizian, ‘‘Very fast, high-performance 5- 2 and 7-2 compressors in CMOS process for rapid parallel accumulations,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 6, pp. 1403–1412, Jun. 2020. [13] T.-B. Juang, P. K. Meher, and K.-S. Jan, ‘‘High-performance logarithmic converters using novel two-region bit-level manipulation schemes,’’ in Proc. Int. Symp. VLSI Design, Autom. Test, Apr. 2011, pp. 1–4. [14] C. Cheng and K. K. Parhi, ‘‘Hardware efficient fast parallel FIR filter structures based on iterated short convolution,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004. [15] S. Y. Park and P. K. Meher, ‘‘Low-power, high-throughput, and low-area adaptive FIR filter based on distributed arithmetic,’’ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 60, no. 6, pp. 346–350, Jun. 2013.

Copyright

Copyright © 2022 Sameena Mohammad, K. Rama Devi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET46913

Publish Date : 2022-09-28

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here